The wine dataset is a highly popular one in the data science community, as it models some of the challenges of real world datasets and can be modeled by a variety of different model types.
We will first explore the data looking for issues or challenges (i.e. missing data, outliers, possible coding errors, multicollinearlity, etc). Once we have a handle on the data, we will apply any necessary cleaning steps. Once we have a reasonable dataset to work with, we will build and evaluate three different linear models that predict seasonal wins. Our dataset includes both training data and evaluation data - we will train using the main training data, then evaluate models based on how well they perform against the holdout evaluation data. Finally we will select a final model that offers the best compromise between accuracy and simplicity.
The wine training set contains 16 columns - including the target variable TARGET - and 12795 rows, covering a variety of different brands of wine. The data-set is entirely numerical variables, but also contains some variables that are highly discrete and have a limited number of possible values. We believe it is still reasonable to treat these as numerical variables since the different values follow a natural numerical order.
Below, we created a chart that describes each variable in the dataset and the theoretical effect it will have on the number of wins projected for a team.
Variables of Interest
Given that the Index column had no impact on the target variable, number of wines, it was dropped.
We compiled summary statistics on our data set to better understand the data before modeling.
| Name | df |
| Number of rows | 12795 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| numeric | 15 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| TARGET | 0 | 1.00 | 3.03 | 1.93 | 0.00 | 2.00 | 3.00 | 4.00 | 8.00 | ▆▇▇▆▁ |
| FixedAcidity | 0 | 1.00 | 7.08 | 6.32 | -18.10 | 5.20 | 6.90 | 9.50 | 34.40 | ▁▂▇▂▁ |
| VolatileAcidity | 0 | 1.00 | 0.32 | 0.78 | -2.79 | 0.13 | 0.28 | 0.64 | 3.68 | ▁▂▇▂▁ |
| CitricAcid | 0 | 1.00 | 0.31 | 0.86 | -3.24 | 0.03 | 0.31 | 0.58 | 3.86 | ▁▂▇▂▁ |
| ResidualSugar | 616 | 0.95 | 5.42 | 33.75 | -127.80 | -2.00 | 3.90 | 15.90 | 141.15 | ▁▂▇▂▁ |
| Chlorides | 638 | 0.95 | 0.05 | 0.32 | -1.17 | -0.03 | 0.05 | 0.15 | 1.35 | ▁▂▇▂▁ |
| FreeSulfurDioxide | 647 | 0.95 | 30.85 | 148.71 | -555.00 | 0.00 | 30.00 | 70.00 | 623.00 | ▁▂▇▂▁ |
| TotalSulfurDioxide | 682 | 0.95 | 120.71 | 231.91 | -823.00 | 27.00 | 123.00 | 208.00 | 1057.00 | ▁▂▇▂▁ |
| Density | 0 | 1.00 | 0.99 | 0.03 | 0.89 | 0.99 | 0.99 | 1.00 | 1.10 | ▁▂▇▂▁ |
| pH | 395 | 0.97 | 3.21 | 0.68 | 0.48 | 2.96 | 3.20 | 3.47 | 6.13 | ▁▂▇▂▁ |
| Sulphates | 1210 | 0.91 | 0.53 | 0.93 | -3.13 | 0.28 | 0.50 | 0.86 | 4.24 | ▁▂▇▂▁ |
| Alcohol | 653 | 0.95 | 10.49 | 3.73 | -4.70 | 9.00 | 10.40 | 12.40 | 26.50 | ▁▂▇▂▁ |
| LabelAppeal | 0 | 1.00 | -0.01 | 0.89 | -2.00 | -1.00 | 0.00 | 1.00 | 2.00 | ▁▅▇▅▁ |
| AcidIndex | 0 | 1.00 | 7.77 | 1.32 | 4.00 | 7.00 | 8.00 | 8.00 | 17.00 | ▁▇▁▁▁ |
| STARS | 3359 | 0.74 | 2.04 | 0.90 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 | ▇▇▁▅▂ |
The first observation is the prevalance of NA’s throughout the dataset. Of the 14 feature columns, 8 of them contain at least some NA values. We also see that the TARGET value is always between 0 and 8, which makes sense as this is the “Number of Cases of Wine Sold” (we would not expect partial cases).
We also note that many of the numerical features measuring the quantity of a chemical in the wine have a negative minimum value. We are assuming the original chemical measurements were normalized (possible a log transform) allowing for negative values, since technically negative concentrations shouldn’t be physically possible. As such, we chose to leave those values as-is.
Next, we wanted to get an idea of the distribution profiles for each of the variables.
We see that most variables have a somewhat normal (although steep) distribution.
The distribution profiles show right skew in variables AcidIndex, and STARS.
More interesting is the shape of many features where they are centered with most values clustered at the center and somewhat uniform shape above and below. It almost appears like a tri-modal distribution with a low, middle and high normal distributions overlapping. We are not going to do extensive feature engineer, or we might consider breaking these features up into 3 separate features each. Two approaches include:
mixTools to separate the multi-modal curves into 3 distinct (and separate) features each capturing just the low, middle or high values and retaining a numerical value.low, middle or high value.In addition to creating histogram distributions, we also elected to use box-plots to get an idea of the spread of each variable.
The box-plots do not reveal any enormous outliers in any of the features, meaning it is unlikely we will need to perform outlier detection and removal. AcidIndex, LabelAppeal, and STARS are essentially categorical in nature (ordinal), so we explore how each value of those features relates with TARGET. We see a clear relationship - as LabelAppeal increases, so does TARGET.
We see the same relationship between STARS and TARGET - we especially note that STARS=NA highly correlates with lower TARGET. In the original project instructions, attention was drawn to the fact that missing data might be informative. Based on this, we will impute STARS=NA into STARS=0 which fits with the other values we see for STARS and the pattern that as stars increase, cases sold increase.
Finally, we wanted to plot scatter plots of each variable versus the target variable, TARGET, to get an idea of the relationship between them.
Due to the discrete nature of the target, it is somewhat difficult to see clear linear relationships in the data. However, it does appear that both STARS and LabelAppeal have a significant positive relationship with the TARGET, and many of the chemical features have at least some negative relationship with the TARGET as lower values led to more values of 8 and 7 in the target variable.
Overall, although our plots indicate some interesting relationships between our variables, they also reveal some significant issues with the data.
For instance, most of the predictor variables are skewed or non-normally distributed, and will need to be transformed. Additionally, there are many data points that contain missing data that will need to be either imputed or discarded. There also was the issue of nonsensical negative values. We have chosen to take the absolute value of these as the feature value as there are so many of these values: however, there is no evidence to back up that this is a correct decision and that these values are not simply missing.
When we initially viewed the first few rows of the raw data, we already noticed missing data. Let’s assess which fields have missing data.
## values ind
## 1 26.25 STARS
## 2 9.46 Sulphates
## 3 5.33 TotalSulfurDioxide
## 4 5.10 Alcohol
## 5 5.06 FreeSulfurDioxide
## 6 4.99 Chlorides
## 7 4.81 ResidualSugar
## 8 3.09 pH
## 9 0.00 TARGET
## 10 0.00 FixedAcidity
## 11 0.00 VolatileAcidity
## 12 0.00 CitricAcid
## 13 0.00 Density
## 14 0.00 LabelAppeal
## 15 0.00 AcidIndex
In the project description, it was noted that the fact that a certain variable is missing may be predictive. We will impute STARS=NA to STARS=0. We will impute the remaining missing data using caret::preProcess and method=knnImpute. Note that preProcess will also center, scale and BoxCox our features at the same time.
With our missing data imputed correctly, we can now build off the scatter plots from above to quantify the correlations between our target variable and predictor variable. We will want to choose those with stronger positive or negative correlations. Features with correlations closer to zero will probably not provide any meaningful information on explaining wins by a team.
## values ind
## 1 0.685381473 STARS
## 2 0.356500469 LabelAppeal
## 3 0.062030498 Alcohol
## 4 0.051730323 TotalSulfurDioxide
## 5 0.043996542 FreeSulfurDioxide
## 6 0.016187709 ResidualSugar
## 7 0.008684633 CitricAcid
## 8 -0.009081197 pH
## 9 -0.035589560 Density
## 10 -0.039072231 Chlorides
## 11 -0.039917146 Sulphates
## 12 -0.049010939 FixedAcidity
## 13 -0.088793212 VolatileAcidity
## 14 -0.221991949 AcidIndex
STARS, LabelAppeal, and Alcohol have the highest correlation with TARGET, which matches what we saw in the variable plots above. Recall that we imputed NA values for STARS as 0. WE note that the missing value dummy indicators do not correlate with TARGET, so these additional columns may not provide much additional predictive power. We will include them for now.
One problem that can occur with multi-variable regression is correlation between variables, or multicolinearity. A quick check is to run correlations between variables.
We see that the features have very low correlations with each other, meaning that there is not much multicollinearity present in the dataset. This means that the assumptions of linear regression are more likely to be met.
To summarize our data preparation and exploration, we can distinguish our findings into a few categories below:
We removed the INDEX field as it offers no information for a model.
For the 8 features with missing values, we created a dummy variable that is 1 if the value is missing and 0 if it is not. We then imputed the missing values as the median of the feature, allowing us to using records with missing values while still including the information that the value is missing.
Many of the numerical features had unreasonable negative values. We transformed these to be the absolute value of the feature, as a data entry error seems likely given the quantity of negative records.
Finally, as mentioned earlier in our data exploration, during the impute step, caret::preProcess() automatically centered, scaled and BoxCox transformed our data. Here are some plots to demonstrate the changes in distributions and final values after the transformations:
We see that after the transformations, the variables are more centered and more closely resemble a normal distribution, although clearly they are still not perfect normal distributions.
With our transformations complete, we can now add these into our clean_df dataframe and continue on to build our models. To better measure each model performance, we split our data into a training and testing data set. We will train using the first, then measure model performance again the testing hold out set.
## [1] "Number of Training Samples: 10238"
## [1] "Number of Testing Samples: 2557"
In this first model, we include all available features. Features include:
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2019 -0.6471 -0.0069 0.4413 3.6530
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.181739 0.320017 0.568
## FixedAcidity -0.000411 0.005809 -0.071
## VolatileAcidity -0.021742 0.005744 -3.785
## CitricAcid 0.003765 0.005678 0.663
## ResidualSugar 0.001296 0.005824 0.223
## Chlorides -0.009576 0.005843 -1.639
## FreeSulfurDioxide 0.013779 0.005750 2.396
## TotalSulfurDioxide 0.019508 0.005882 3.317
## Density -0.007099 0.005748 -1.235
## pH -0.008008 0.005786 -1.384
## Sulphates -0.014142 0.005935 -2.383
## Alcohol 0.013964 0.005870 2.379
## as.factor(LabelAppeal)-1.11204793733397 0.254481 0.042766 5.951
## as.factor(LabelAppeal)0.0101741115806247 0.442918 0.041730 10.614
## as.factor(LabelAppeal)1.13239616049522 0.579996 0.042436 13.668
## as.factor(LabelAppeal)2.25461820940981 0.715625 0.047786 14.976
## as.factor(AcidIndex)-3.59682937695875 -0.370876 0.324680 -1.142
## as.factor(AcidIndex)-1.79176983045029 -0.298840 0.317599 -0.941
## as.factor(AcidIndex)-0.545318540973785 -0.319948 0.317282 -1.008
## as.factor(AcidIndex)0.362910765511677 -0.348949 0.317343 -1.100
## as.factor(AcidIndex)1.05172974217783 -0.463021 0.317757 -1.457
## as.factor(AcidIndex)1.59059728918163 -0.600379 0.319144 -1.881
## as.factor(AcidIndex)2.02271372429848 -0.993138 0.323403 -3.071
## as.factor(AcidIndex)2.37629509167962 -1.060688 0.331479 -3.200
## as.factor(AcidIndex)2.67051656830802 -0.870580 0.334070 -2.606
## as.factor(AcidIndex)2.9188445277671 -0.868512 0.344947 -2.518
## as.factor(AcidIndex)3.13100139587667 -0.461426 0.437919 -1.054
## as.factor(AcidIndex)3.31417429494859 -1.140727 0.548705 -2.079
## as.factor(AcidIndex)3.47378568897179 -1.268681 0.548933 -2.311
## as.factor(STARS)-0.42623524866846 0.751031 0.021945 34.223
## as.factor(STARS)0.416552574962037 1.065094 0.020481 52.003
## as.factor(STARS)1.25934039859254 1.185733 0.021509 55.127
## as.factor(STARS)2.10212822222303 1.315180 0.027314 48.150
## Pr(>|z|)
## (Intercept) 0.570100
## FixedAcidity 0.943600
## VolatileAcidity 0.000154 ***
## CitricAcid 0.507247
## ResidualSugar 0.823916
## Chlorides 0.101242
## FreeSulfurDioxide 0.016559 *
## TotalSulfurDioxide 0.000911 ***
## Density 0.216792
## pH 0.166379
## Sulphates 0.017181 *
## Alcohol 0.017366 *
## as.factor(LabelAppeal)-1.11204793733397 0.00000000267 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.253336
## as.factor(AcidIndex)-1.79176983045029 0.346738
## as.factor(AcidIndex)-0.545318540973785 0.313261
## as.factor(AcidIndex)0.362910765511677 0.271508
## as.factor(AcidIndex)1.05172974217783 0.145073
## as.factor(AcidIndex)1.59059728918163 0.059943 .
## as.factor(AcidIndex)2.02271372429848 0.002134 **
## as.factor(AcidIndex)2.37629509167962 0.001375 **
## as.factor(AcidIndex)2.67051656830802 0.009161 **
## as.factor(AcidIndex)2.9188445277671 0.011809 *
## as.factor(AcidIndex)3.13100139587667 0.292030
## as.factor(AcidIndex)3.31417429494859 0.037622 *
## as.factor(AcidIndex)3.47378568897179 0.020823 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18304 on 10237 degrees of freedom
## Residual deviance: 10852 on 10205 degrees of freedom
## AIC: 36482
##
## Number of Fisher Scoring iterations: 6
## RMSE Rsquared MAE aic bic
## 2.5914580 0.5186798 2.2269670 36481.6428640 36720.3602957
In this second model, we only include the most predictive features based on our first Poisson Model. The predictors for the following model are:
VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
## Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS), family = poisson, data = clean_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2335 -0.6501 -0.0035 0.4410 3.6951
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.001201 0.318675 0.004
## VolatileAcidity -0.023533 0.005122 -4.595
## TotalSulfurDioxide 0.016952 0.005244 3.232
## Alcohol 0.016009 0.005231 3.060
## as.factor(LabelAppeal)-1.11204793733397 0.240289 0.037999 6.324
## as.factor(LabelAppeal)0.0101741115806247 0.430806 0.037064 11.623
## as.factor(LabelAppeal)1.13239616049522 0.564104 0.037710 14.959
## as.factor(LabelAppeal)2.25461820940981 0.699216 0.042446 16.473
## as.factor(AcidIndex)-3.59682937695875 -0.140143 0.322337 -0.435
## as.factor(AcidIndex)-1.79176983045029 -0.097789 0.316887 -0.309
## as.factor(AcidIndex)-0.545318540973785 -0.131342 0.316605 -0.415
## as.factor(AcidIndex)0.362910765511677 -0.162635 0.316637 -0.514
## as.factor(AcidIndex)1.05172974217783 -0.272917 0.316940 -0.861
## as.factor(AcidIndex)1.59059728918163 -0.432064 0.318025 -1.359
## as.factor(AcidIndex)2.02271372429848 -0.795786 0.321596 -2.474
## as.factor(AcidIndex)2.37629509167962 -0.810861 0.327262 -2.478
## as.factor(AcidIndex)2.67051656830802 -0.646459 0.330156 -1.958
## as.factor(AcidIndex)2.9188445277671 -0.738613 0.342710 -2.155
## as.factor(AcidIndex)3.13100139587667 -0.286638 0.403433 -0.710
## as.factor(AcidIndex)3.31417429494859 -0.951704 0.547993 -1.737
## as.factor(AcidIndex)3.47378568897179 -1.196936 0.548071 -2.184
## as.factor(STARS)-0.42623524866846 0.757245 0.019564 38.705
## as.factor(STARS)0.416552574962037 1.075518 0.018261 58.896
## as.factor(STARS)1.25934039859254 1.194174 0.019232 62.093
## as.factor(STARS)2.10212822222303 1.313789 0.024330 53.999
## Pr(>|z|)
## (Intercept) 0.99699
## VolatileAcidity 0.000004335408 ***
## TotalSulfurDioxide 0.00123 **
## Alcohol 0.00221 **
## as.factor(LabelAppeal)-1.11204793733397 0.000000000256 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.66373
## as.factor(AcidIndex)-1.79176983045029 0.75763
## as.factor(AcidIndex)-0.545318540973785 0.67825
## as.factor(AcidIndex)0.362910765511677 0.60751
## as.factor(AcidIndex)1.05172974217783 0.38918
## as.factor(AcidIndex)1.59059728918163 0.17428
## as.factor(AcidIndex)2.02271372429848 0.01334 *
## as.factor(AcidIndex)2.37629509167962 0.01322 *
## as.factor(AcidIndex)2.67051656830802 0.05023 .
## as.factor(AcidIndex)2.9188445277671 0.03114 *
## as.factor(AcidIndex)3.13100139587667 0.47740
## as.factor(AcidIndex)3.31417429494859 0.08244 .
## as.factor(AcidIndex)3.47378568897179 0.02897 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 13550 on 12770 degrees of freedom
## AIC: 45542
##
## Number of Fisher Scoring iterations: 6
## RMSE Rsquared MAE aic bic
## 2.5911736 0.5169439 2.2257546 45542.3183589 45728.7386026
Similar to Poisson Model 1, the predictors for the following model are:
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = clean_df,
## init.theta = 40957.00204, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2219 -0.6496 -0.0055 0.4446 3.6790
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.0275198 0.3190721 0.086
## FixedAcidity 0.0010176 0.0051829 0.196
## VolatileAcidity -0.0231944 0.0051231 -4.527
## CitricAcid 0.0039724 0.0050838 0.781
## ResidualSugar 0.0005767 0.0052060 0.111
## Chlorides -0.0122552 0.0052206 -2.347
## FreeSulfurDioxide 0.0132531 0.0051832 2.557
## TotalSulfurDioxide 0.0170166 0.0052484 3.242
## Density -0.0074549 0.0050938 -1.464
## pH -0.0066655 0.0051875 -1.285
## Sulphates -0.0106282 0.0053149 -2.000
## Alcohol 0.0157805 0.0052355 3.014
## as.factor(LabelAppeal)-1.11204793733397 0.2398534 0.0380017 6.312
## as.factor(LabelAppeal)0.0101741115806247 0.4300463 0.0370666 11.602
## as.factor(LabelAppeal)1.13239616049522 0.5633460 0.0377142 14.937
## as.factor(LabelAppeal)2.25461820940981 0.6992610 0.0424548 16.471
## as.factor(AcidIndex)-3.59682937695875 -0.1651538 0.3226593 -0.512
## as.factor(AcidIndex)-1.79176983045029 -0.1214774 0.3172467 -0.383
## as.factor(AcidIndex)-0.545318540973785 -0.1558113 0.3169954 -0.492
## as.factor(AcidIndex)0.362910765511677 -0.1871144 0.3170413 -0.590
## as.factor(AcidIndex)1.05172974217783 -0.2972766 0.3173791 -0.937
## as.factor(AcidIndex)1.59059728918163 -0.4542592 0.3184811 -1.426
## as.factor(AcidIndex)2.02271372429848 -0.8158352 0.3220684 -2.533
## as.factor(AcidIndex)2.37629509167962 -0.8303961 0.3277299 -2.534
## as.factor(AcidIndex)2.67051656830802 -0.6688133 0.3306330 -2.023
## as.factor(AcidIndex)2.9188445277671 -0.7687131 0.3432641 -2.239
## as.factor(AcidIndex)3.13100139587667 -0.3297889 0.4038365 -0.817
## as.factor(AcidIndex)3.31417429494859 -0.9814037 0.5484760 -1.789
## as.factor(AcidIndex)3.47378568897179 -1.2022430 0.5486104 -2.191
## as.factor(STARS)-0.42623524866846 0.7548590 0.0195728 38.567
## as.factor(STARS)0.416552574962037 1.0732229 0.0182738 58.730
## as.factor(STARS)1.25934039859254 1.1910222 0.0192473 61.880
## as.factor(STARS)2.10212822222303 1.3117031 0.0243440 53.882
## Pr(>|z|)
## (Intercept) 0.93127
## FixedAcidity 0.84435
## VolatileAcidity 0.000005971523 ***
## CitricAcid 0.43458
## ResidualSugar 0.91179
## Chlorides 0.01890 *
## FreeSulfurDioxide 0.01056 *
## TotalSulfurDioxide 0.00119 **
## Density 0.14332
## pH 0.19882
## Sulphates 0.04554 *
## Alcohol 0.00258 **
## as.factor(LabelAppeal)-1.11204793733397 0.000000000276 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.60875
## as.factor(AcidIndex)-1.79176983045029 0.70179
## as.factor(AcidIndex)-0.545318540973785 0.62305
## as.factor(AcidIndex)0.362910765511677 0.55506
## as.factor(AcidIndex)1.05172974217783 0.34893
## as.factor(AcidIndex)1.59059728918163 0.15377
## as.factor(AcidIndex)2.02271372429848 0.01131 *
## as.factor(AcidIndex)2.37629509167962 0.01128 *
## as.factor(AcidIndex)2.67051656830802 0.04309 *
## as.factor(AcidIndex)2.9188445277671 0.02513 *
## as.factor(AcidIndex)3.13100139587667 0.41413
## as.factor(AcidIndex)3.31417429494859 0.07356 .
## as.factor(AcidIndex)3.47378568897179 0.02842 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40957) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 13529 on 12762 degrees of freedom
## AIC: 45540
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40957
## Std. Err.: 34344
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45472.13
## RMSE Rsquared MAE aic bic
## 2.5906992 0.5182237 2.2252109 45540.1321552 45793.6636866
Similar to Poisson Model 2, the predictors for the following model are:
VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide +
## TotalSulfurDioxide + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS), data = clean_df, init.theta = 40912.67371,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2456 -0.6517 -0.0038 0.4399 3.6952
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.014317 0.318731 0.045
## VolatileAcidity -0.023451 0.005122 -4.578
## FreeSulfurDioxide 0.013266 0.005179 2.562
## TotalSulfurDioxide 0.016879 0.005244 3.219
## Alcohol 0.016237 0.005231 3.104
## as.factor(LabelAppeal)-1.11204793733397 0.240039 0.038000 6.317
## as.factor(LabelAppeal)0.0101741115806247 0.430314 0.037065 11.610
## as.factor(LabelAppeal)1.13239616049522 0.563287 0.037712 14.936
## as.factor(LabelAppeal)2.25461820940981 0.698162 0.042449 16.447
## as.factor(AcidIndex)-3.59682937695875 -0.154135 0.322401 -0.478
## as.factor(AcidIndex)-1.79176983045029 -0.110580 0.316945 -0.349
## as.factor(AcidIndex)-0.545318540973785 -0.143771 0.316660 -0.454
## as.factor(AcidIndex)0.362910765511677 -0.174856 0.316690 -0.552
## as.factor(AcidIndex)1.05172974217783 -0.285228 0.316994 -0.900
## as.factor(AcidIndex)1.59059728918163 -0.443142 0.318072 -1.393
## as.factor(AcidIndex)2.02271372429848 -0.806122 0.321638 -2.506
## as.factor(AcidIndex)2.37629509167962 -0.819576 0.327296 -2.504
## as.factor(AcidIndex)2.67051656830802 -0.655682 0.330193 -1.986
## as.factor(AcidIndex)2.9188445277671 -0.753474 0.342775 -2.198
## as.factor(AcidIndex)3.13100139587667 -0.299583 0.403483 -0.742
## as.factor(AcidIndex)3.31417429494859 -0.953685 0.548008 -1.740
## as.factor(AcidIndex)3.47378568897179 -1.205542 0.548094 -2.200
## as.factor(STARS)-0.42623524866846 0.756497 0.019567 38.662
## as.factor(STARS)0.416552574962037 1.074730 0.018264 58.844
## as.factor(STARS)1.25934039859254 1.193602 0.019234 62.057
## as.factor(STARS)2.10212822222303 1.314420 0.024331 54.023
## Pr(>|z|)
## (Intercept) 0.96417
## VolatileAcidity 0.000004685011 ***
## FreeSulfurDioxide 0.01042 *
## TotalSulfurDioxide 0.00129 **
## Alcohol 0.00191 **
## as.factor(LabelAppeal)-1.11204793733397 0.000000000267 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.63259
## as.factor(AcidIndex)-1.79176983045029 0.72717
## as.factor(AcidIndex)-0.545318540973785 0.64981
## as.factor(AcidIndex)0.362910765511677 0.58086
## as.factor(AcidIndex)1.05172974217783 0.36823
## as.factor(AcidIndex)1.59059728918163 0.16356
## as.factor(AcidIndex)2.02271372429848 0.01220 *
## as.factor(AcidIndex)2.37629509167962 0.01228 *
## as.factor(AcidIndex)2.67051656830802 0.04706 *
## as.factor(AcidIndex)2.9188445277671 0.02794 *
## as.factor(AcidIndex)3.13100139587667 0.45779
## as.factor(AcidIndex)3.31417429494859 0.08181 .
## as.factor(AcidIndex)3.47378568897179 0.02784 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40912.67) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 13543 on 12769 degrees of freedom
## AIC: 45540
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40913
## Std. Err.: 34297
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45486.18
## RMSE Rsquared MAE aic bic
## 2.5909365 0.5174052 2.2255553 45540.1774667 45741.5113299
The predictors for the following model are:
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
##
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9635 -0.8591 0.0325 0.8384 6.0750
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.995095 0.755491 1.317
## FixedAcidity 0.004930 0.011720 0.421
## VolatileAcidity -0.073667 0.011565 -6.370
## CitricAcid 0.014455 0.011558 1.251
## ResidualSugar 0.004027 0.011782 0.342
## Chlorides -0.038485 0.011776 -3.268
## FreeSulfurDioxide 0.039825 0.011786 3.379
## TotalSulfurDioxide 0.049447 0.011816 4.185
## Density -0.022475 0.011545 -1.947
## pH -0.018619 0.011704 -1.591
## Sulphates -0.028248 0.012015 -2.351
## Alcohol 0.050812 0.011830 4.295
## as.factor(LabelAppeal)-1.11204793733397 0.367639 0.062729 5.861
## as.factor(LabelAppeal)0.0101741115806247 0.835185 0.061168 13.654
## as.factor(LabelAppeal)1.13239616049522 1.302062 0.063917 20.371
## as.factor(LabelAppeal)2.25461820940981 1.889951 0.084169 22.454
## as.factor(AcidIndex)-3.59682937695875 -0.334854 0.767938 -0.436
## as.factor(AcidIndex)-1.79176983045029 -0.220221 0.754101 -0.292
## as.factor(AcidIndex)-0.545318540973785 -0.322349 0.753472 -0.428
## as.factor(AcidIndex)0.362910765511677 -0.429363 0.753539 -0.570
## as.factor(AcidIndex)1.05172974217783 -0.732560 0.754117 -0.971
## as.factor(AcidIndex)1.59059728918163 -1.041297 0.755385 -1.378
## as.factor(AcidIndex)2.02271372429848 -1.513113 0.757752 -1.997
## as.factor(AcidIndex)2.37629509167962 -1.533481 0.762156 -2.012
## as.factor(AcidIndex)2.67051656830802 -1.552014 0.769606 -2.017
## as.factor(AcidIndex)2.9188445277671 -1.400154 0.777299 -1.801
## as.factor(AcidIndex)3.13100139587667 -0.692206 0.883131 -0.784
## as.factor(AcidIndex)3.31417429494859 -1.772148 0.952843 -1.860
## as.factor(AcidIndex)3.47378568897179 -1.920432 0.900840 -2.132
## as.factor(STARS)-0.42623524866846 1.346560 0.032920 40.904
## as.factor(STARS)0.416552574962037 2.381720 0.032021 74.381
## as.factor(STARS)1.25934039859254 2.942287 0.037079 79.352
## as.factor(STARS)2.10212822222303 3.629958 0.059150 61.368
## Pr(>|t|)
## (Intercept) 0.18781
## FixedAcidity 0.67401
## VolatileAcidity 0.000000000196 ***
## CitricAcid 0.21109
## ResidualSugar 0.73251
## Chlorides 0.00109 **
## FreeSulfurDioxide 0.00073 ***
## TotalSulfurDioxide 0.000028712970 ***
## Density 0.05160 .
## pH 0.11167
## Sulphates 0.01873 *
## Alcohol 0.000017590098 ***
## as.factor(LabelAppeal)-1.11204793733397 0.000000004722 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.66281
## as.factor(AcidIndex)-1.79176983045029 0.77027
## as.factor(AcidIndex)-0.545318540973785 0.66879
## as.factor(AcidIndex)0.362910765511677 0.56883
## as.factor(AcidIndex)1.05172974217783 0.33136
## as.factor(AcidIndex)1.59059728918163 0.16807
## as.factor(AcidIndex)2.02271372429848 0.04586 *
## as.factor(AcidIndex)2.37629509167962 0.04424 *
## as.factor(AcidIndex)2.67051656830802 0.04375 *
## as.factor(AcidIndex)2.9188445277671 0.07168 .
## as.factor(AcidIndex)3.13100139587667 0.43317
## as.factor(AcidIndex)3.31417429494859 0.06293 .
## as.factor(AcidIndex)3.47378568897179 0.03304 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.302 on 12762 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5429
## F-statistic: 475.9 on 32 and 12762 DF, p-value: < 0.00000000000000022
## RMSE Rsquared MAE aic bic
## 1.3049443 0.5419395 1.0186507 43106.3022739 43359.8338054
For the final Linear Model, we leverage stepAIC on our Linear Model #5 to choose the most important features.
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## as.factor(LabelAppeal) + as.factor(AcidIndex) + as.factor(STARS),
## data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9616 -0.8590 0.0352 0.8399 6.0675
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.99019 0.75488 1.312
## VolatileAcidity -0.07393 0.01156 -6.394
## Chlorides -0.03864 0.01177 -3.282
## FreeSulfurDioxide 0.04009 0.01178 3.403
## TotalSulfurDioxide 0.04960 0.01181 4.200
## Density -0.02270 0.01154 -1.967
## pH -0.01862 0.01170 -1.591
## Sulphates -0.02837 0.01201 -2.362
## Alcohol 0.05100 0.01183 4.313
## as.factor(LabelAppeal)-1.11204793733397 0.36722 0.06272 5.854
## as.factor(LabelAppeal)0.0101741115806247 0.83483 0.06116 13.649
## as.factor(LabelAppeal)1.13239616049522 1.30161 0.06391 20.367
## as.factor(LabelAppeal)2.25461820940981 1.88998 0.08416 22.456
## as.factor(AcidIndex)-3.59682937695875 -0.33511 0.76744 -0.437
## as.factor(AcidIndex)-1.79176983045029 -0.21824 0.75353 -0.290
## as.factor(AcidIndex)-0.545318540973785 -0.31837 0.75287 -0.423
## as.factor(AcidIndex)0.362910765511677 -0.42442 0.75293 -0.564
## as.factor(AcidIndex)1.05172974217783 -0.72561 0.75343 -0.963
## as.factor(AcidIndex)1.59059728918163 -1.03390 0.75463 -1.370
## as.factor(AcidIndex)2.02271372429848 -1.50407 0.75695 -1.987
## as.factor(AcidIndex)2.37629509167962 -1.52217 0.76131 -1.999
## as.factor(AcidIndex)2.67051656830802 -1.54018 0.76867 -2.004
## as.factor(AcidIndex)2.9188445277671 -1.38512 0.77635 -1.784
## as.factor(AcidIndex)3.13100139587667 -0.67937 0.88243 -0.770
## as.factor(AcidIndex)3.31417429494859 -1.75343 0.95178 -1.842
## as.factor(AcidIndex)3.47378568897179 -1.89498 0.89978 -2.106
## as.factor(STARS)-0.42623524866846 1.34682 0.03291 40.918
## as.factor(STARS)0.416552574962037 2.38256 0.03201 74.442
## as.factor(STARS)1.25934039859254 2.94276 0.03707 79.374
## as.factor(STARS)2.10212822222303 3.63105 0.05914 61.397
## Pr(>|t|)
## (Intercept) 0.189639
## VolatileAcidity 0.000000000167 ***
## Chlorides 0.001034 **
## FreeSulfurDioxide 0.000669 ***
## TotalSulfurDioxide 0.000026926678 ***
## Density 0.049224 *
## pH 0.111585
## Sulphates 0.018191 *
## Alcohol 0.000016245384 ***
## as.factor(LabelAppeal)-1.11204793733397 0.000000004904 ***
## as.factor(LabelAppeal)0.0101741115806247 < 0.0000000000000002 ***
## as.factor(LabelAppeal)1.13239616049522 < 0.0000000000000002 ***
## as.factor(LabelAppeal)2.25461820940981 < 0.0000000000000002 ***
## as.factor(AcidIndex)-3.59682937695875 0.662366
## as.factor(AcidIndex)-1.79176983045029 0.772104
## as.factor(AcidIndex)-0.545318540973785 0.672396
## as.factor(AcidIndex)0.362910765511677 0.572976
## as.factor(AcidIndex)1.05172974217783 0.335525
## as.factor(AcidIndex)1.59059728918163 0.170687
## as.factor(AcidIndex)2.02271372429848 0.046941 *
## as.factor(AcidIndex)2.37629509167962 0.045584 *
## as.factor(AcidIndex)2.67051656830802 0.045124 *
## as.factor(AcidIndex)2.9188445277671 0.074426 .
## as.factor(AcidIndex)3.13100139587667 0.441383
## as.factor(AcidIndex)3.31417429494859 0.065461 .
## as.factor(AcidIndex)3.47378568897179 0.035220 *
## as.factor(STARS)-0.42623524866846 < 0.0000000000000002 ***
## as.factor(STARS)0.416552574962037 < 0.0000000000000002 ***
## as.factor(STARS)1.25934039859254 < 0.0000000000000002 ***
## as.factor(STARS)2.10212822222303 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.302 on 12765 degrees of freedom
## Multiple R-squared: 0.544, Adjusted R-squared: 0.543
## F-statistic: 525.1 on 29 and 12765 DF, p-value: < 0.00000000000000022
## RMSE Rsquared MAE aic bic
## 1.3050053 0.5418966 1.0186907 43102.1597330 43333.3208352
This table summarizes the RMSE, \(R^2\), MAE, AIC and BIC for all 6 models. In terms of raw metrics, The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and \(R^2\); however, Poisson Model 1 had the best performance based on AIC and BIC.
Overall, RMSE an \(R^2\) were not largely different across the 6 models, given this we chose Poisson Model 1 as our final model since it had a far lower AIC.
| RMSE | Rsquared | MAE | aic | bic | |
|---|---|---|---|---|---|
| poiss1_eval | 2.591458 | 0.5186798 | 2.226967 | 36481.64 | 36720.36 |
| poiss2_eval | 2.591174 | 0.5169439 | 2.225755 | 45542.32 | 45728.74 |
| nb3_eval | 2.590699 | 0.5182237 | 2.225211 | 45540.13 | 45793.66 |
| nb4_eval | 2.590937 | 0.5174052 | 2.225555 | 45540.18 | 45741.51 |
| lm5_eval | 1.304944 | 0.5419395 | 1.018651 | 43106.30 | 43359.83 |
| lm6_eval | 1.305005 | 0.5418966 | 1.018691 | 43102.16 | 43333.32 |
The final thing we will take a look at is our variable importance for each model.
The final figure presented below shows the feature importance in each model. As we see, all 6 models identified the same top 10 features.
We apply Poisson Model #1 to the holdout evaluation set to predict the TARGET for these instances. We have saved these predictions as csv in the file eval_predictions.csv.
Source code: https://github.com/djlofland/DS621_F2020_Group3/tree/master/Homework_5/eval_predictions.csv
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1 0.17715256 -0.26524401 -1.51030924 -0.04455813 -0.4776009 0.11673887
## 2 1.35023830 0.84276407 0.07767213 -1.23934313 -0.7442725 3.49856180
## 3 0.93886293 0.01967235 1.81871194 -0.16055667 -1.1383538 0.03195779
## 4 0.84639842 -0.13861451 -0.28584168 1.73021959 -0.1309278 -0.73421194
## 5 0.01159925 0.68447720 -0.14553811 -0.03295827 -0.1250018 -0.05282329
## 6 1.71495385 1.66585579 -0.36237090 -1.69173745 -0.1190758 1.50777652
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates
## 1 -0.05275591 1.19564455 -0.3402128 2.66647966 0.1211079
## 2 -0.45621338 -0.22730155 -0.1442311 0.23889195 0.6038736
## 3 -0.14689598 -0.19280589 1.9789246 2.06326090 0.1640204
## 4 0.49191169 -0.13675044 -0.2085894 -0.01122314 1.6981424
## 5 0.26328578 -0.29198092 1.3139028 -0.98225823 -0.6405890
## 6 -1.88848742 0.08315942 -1.6483055 -0.21720028 -0.5869484
## Alcohol LabelAppeal AcidIndex STARS
## 1 0.4857435 -1.11204794 -1.7917698 -1.2690231
## 2 1.4782809 0.01017411 -1.7917698 0.4165526
## 3 -0.5202067 0.01017411 0.3629108 -0.4262352
## 4 0.4857435 -1.11204794 0.3629108 -0.4262352
## 5 -1.5261568 0.01017411 1.5905973 -1.2690231
## 6 0.2443154 1.13239616 0.3629108 2.1021282
[[[INSERT R CODE HERE ONCE ALL CHANGE DONE]]]